import pandas as pd
import numpy as np
from PIL import Image
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.offline import iplot, init_notebook_mode
import plotly.express as px
from wordcloud import WordCloud
import pycountry
A startup is a newly established company or project founded by an entrepreneur to explore innovative ideas and create a scalable business model. The success of a startup is highly uncertain and comes with significant risks. Many startups face the possibility of failure due to intense competition, market fluctuations, financial challenges, and other obstacles. However, a minority of startups do manage to overcome these hurdles and achieve remarkable success, becoming influential players in their respective industries. The journey of a startup is characterized by its adventurous spirit, where the pursuit of success is accompanied by the acknowledgment of potential failure, making it a challenging yet rewarding endeavor for ambitious entrepreneurs.
im = Image.open("assets/Startup-Funding-Series.jpg")
display(im)
Startups have a profound impact on the economy, fostering innovation, job creation, and technological advancements. These entrepreneurial ventures inject fresh ideas and disrupt traditional industries, driving competition and spurring economic growth. However, startups also entail inherent risks, as many of them face failure, leading to potential financial losses and missed opportunities. Predicting whether a startup will succeed or fail is crucial for several reasons. Firstly, it allows investors and stakeholders to make informed decisions, allocating resources efficiently. Secondly, predicting success or failure can help guide government policies and support programs to nurture promising startups and mitigate risks. Additionally, understanding the factors that contribute to success or failure aids entrepreneurs in making strategic adjustments and increasing their chances of building sustainable and influential businesses, ultimately contributing to the overall health and vibrancy of the economy.
The task at hand involves predictiong the likelihood of success or failure for a given startup. The challenge is to develop a comprehensive and accurate predictive model that takes into account various factors and variables influencing startup outcomes.
df = pd.read_csv("data_input/investments_VC.csv", encoding = 'unicode_escape')
df.rename(columns = {' market ': 'market', ' funding_total_usd ': 'funding_total_usd'}, inplace = True)
df.head()
| permalink | name | homepage_url | category_list | market | funding_total_usd | status | country_code | state_code | region | ... | secondary_market | product_crowdfunding | round_A | round_B | round_C | round_D | round_E | round_F | round_G | round_H | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | /organization/waywire | #waywire | http://www.waywire.com | |Entertainment|Politics|Social Media|News| | News | 17,50,000 | acquired | USA | NY | New York City | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | /organization/tv-communications | &TV Communications | http://enjoyandtv.com | |Games| | Games | 40,00,000 | operating | USA | CA | Los Angeles | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | /organization/rock-your-paper | 'Rock' Your Paper | http://www.rockyourpaper.org | |Publishing|Education| | Publishing | 40,000 | operating | EST | NaN | Tallinn | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | /organization/in-touch-network | (In)Touch Network | http://www.InTouchNetwork.com | |Electronics|Guides|Coffee|Restaurants|Music|i... | Electronics | 15,00,000 | operating | GBR | NaN | London | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | /organization/r-ranch-and-mine | -R- Ranch and Mine | NaN | |Tourism|Entertainment|Games| | Tourism | 60,000 | operating | USA | TX | Dallas | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 39 columns
im2 = Image.open("assets/Graph-Showing-The-Various-Stages-Of-Startup-Investment-PowerPoint-Slides.png")
display(im2)
im3 = Image.open("assets/Medium.png")
display(im3)
df.shape
(54294, 39)
CATEGORICAL = list(filter(lambda x: df[x].dtype=='object',df.columns))
def show_stats(columns):
stat = {}
for col in columns:
stat[col] = [df[col].nunique(),
df[col].isna().mean()*100,
df[col].dtype]
return pd.DataFrame.from_dict(stat, orient='index', columns=['Unique', 'Missed values in %', 'DType'])
show_stats(CATEGORICAL)
| Unique | Missed values in % | DType | |
|---|---|---|---|
| permalink | 49436 | 8.943898 | object |
| name | 49350 | 8.945740 | object |
| homepage_url | 45850 | 15.296350 | object |
| category_list | 16675 | 16.239363 | object |
| market | 753 | 16.252256 | object |
| funding_total_usd | 14617 | 8.943898 | object |
| status | 3 | 11.364055 | object |
| country_code | 115 | 18.655837 | object |
| state_code | 61 | 44.448742 | object |
| region | 1089 | 18.655837 | object |
| city | 4188 | 20.208494 | object |
| founded_at | 3369 | 28.990312 | object |
| founded_month | 420 | 29.122923 | object |
| founded_quarter | 218 | 29.122923 | object |
| first_funding_at | 3914 | 8.943898 | object |
| last_funding_at | 3657 | 8.943898 | object |
State code column can be dropped as it doesn't really play a big role, but has too much missed values We can remove everything connected with links, as there are too many unique values for them. Also remove category list, because we already have Market column
df = df.drop('state_code', axis=1)
df = df.drop(['permalink', 'homepage_url', 'category_list'], axis=1)
funding_total_usd should be converted from str to int
df['funding_total_usd'].head(2)
0 17,50,000 1 40,00,000 Name: funding_total_usd, dtype: object
df.drop(df[df['funding_total_usd'] == ' - '].index,inplace=True)
df['funding_total_usd'] = df['funding_total_usd'].str.replace(",","")
df.dropna( axis=0,subset=['funding_total_usd'],inplace = True) and df.dropna( axis=0,subset=['market'],inplace = True)
df['funding_total_usd']=df['funding_total_usd'].astype(np.int64)
marketdf = df[['market','name','funding_total_usd','status']]
show_stats(list(filter(lambda x: df[x].dtype =='float64', df.columns)))
| Unique | Missed values in % | DType | |
|---|---|---|---|
| funding_rounds | 17 | 0.000000 | float64 |
| founded_year | 101 | 21.443763 | float64 |
| seed | 3337 | 0.000000 | float64 |
| venture | 9300 | 0.000000 | float64 |
| equity_crowdfunding | 252 | 0.000000 | float64 |
| undisclosed | 687 | 0.000000 | float64 |
| convertible_note | 299 | 0.000000 | float64 |
| debt_financing | 1872 | 0.000000 | float64 |
| angel | 999 | 0.000000 | float64 |
| grant | 532 | 0.000000 | float64 |
| private_equity | 847 | 0.000000 | float64 |
| post_ipo_equity | 239 | 0.000000 | float64 |
| post_ipo_debt | 57 | 0.000000 | float64 |
| secondary_market | 20 | 0.000000 | float64 |
| product_crowdfunding | 176 | 0.000000 | float64 |
| round_A | 2035 | 0.000000 | float64 |
| round_B | 1269 | 0.000000 | float64 |
| round_C | 740 | 0.000000 | float64 |
| round_D | 458 | 0.000000 | float64 |
| round_E | 225 | 0.000000 | float64 |
| round_F | 110 | 0.000000 | float64 |
| round_G | 32 | 0.000000 | float64 |
| round_H | 5 | 0.000000 | float64 |
df = df.dropna()
df = df.drop_duplicates()
print('Length after cleanup', len(df))
Length after cleanup 27970
def code_to_country(text: str):
if text == 'ROM':
text = 'ROU'
return pycountry.countries.get(alpha_3=text).name
df['country'] = df['country_code'].apply(code_to_country)
df.drop('country_code', axis=1, inplace=True)
df['country'].head()
0 United States 2 Estonia 3 United Kingdom 4 United States 10 Chile Name: country, dtype: object
df.describe()
| funding_total_usd | funding_rounds | founded_year | seed | venture | equity_crowdfunding | undisclosed | convertible_note | debt_financing | angel | ... | secondary_market | product_crowdfunding | round_A | round_B | round_C | round_D | round_E | round_F | round_G | round_H | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2.797000e+04 | 27970.000000 | 27970.000000 | 2.797000e+04 | 2.797000e+04 | 2.797000e+04 | 2.797000e+04 | 2.797000e+04 | 2.797000e+04 | 2.797000e+04 | ... | 2.797000e+04 | 2.797000e+04 | 2.797000e+04 | 2.797000e+04 | 2.797000e+04 | 2.797000e+04 | 2.797000e+04 | 2.797000e+04 | 2.797000e+04 | 2.797000e+04 |
| mean | 1.774500e+07 | 2.007866 | 2007.064140 | 3.044336e+05 | 1.065874e+07 | 7.579419e+03 | 1.051359e+05 | 3.182541e+04 | 2.595650e+06 | 8.701276e+04 | ... | 4.062824e+04 | 1.000055e+04 | 1.686132e+06 | 2.188271e+06 | 1.860551e+06 | 1.196603e+06 | 5.584919e+05 | 2.877252e+05 | 9.824156e+04 | 2.499106e+04 |
| std | 1.955124e+08 | 1.509339 | 7.380901 | 1.077503e+06 | 3.572890e+07 | 1.960442e+05 | 2.357928e+06 | 1.838252e+06 | 1.819967e+08 | 6.047256e+05 | ... | 3.100436e+06 | 5.581497e+05 | 5.953910e+06 | 9.238159e+06 | 1.009298e+07 | 1.283717e+07 | 7.013485e+06 | 8.319132e+06 | 6.966990e+06 | 3.611924e+06 |
| min | 3.000000e+01 | 1.000000 | 1902.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 25% | 4.500000e+05 | 1.000000 | 2005.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 50% | 2.276304e+06 | 1.000000 | 2009.000000 | 0.000000e+00 | 9.295685e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 75% | 1.100000e+07 | 2.000000 | 2011.000000 | 1.286378e+05 | 8.698066e+06 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| max | 3.007950e+10 | 18.000000 | 2014.000000 | 1.000000e+08 | 2.351000e+09 | 1.700000e+07 | 2.508000e+08 | 3.000000e+08 | 3.007950e+10 | 3.025439e+07 | ... | 4.000000e+08 | 7.200000e+07 | 2.250000e+08 | 5.420000e+08 | 4.900000e+08 | 1.200000e+09 | 4.000000e+08 | 1.060000e+09 | 1.000000e+09 | 6.000000e+08 |
8 rows × 24 columns
print('Mean total funding (in $):',df['funding_total_usd'].mean())
print('Median total funding (in $):',df['funding_total_usd'].median())
print('Mean year founded:',round(df['founded_year'].mean()))
print('The oldest company: ', int(min(df['founded_year'])))
print('The latest company: ', int(max(df['founded_year'])))
print('Median seed (first official investment round): ', df['seed'].std())
Mean total funding (in $): 17744997.722130854 Median total funding (in $): 2276304.0 Mean year founded: 2007 The oldest company: 1902 The latest company: 2014 Median seed (first official investment round): 1077502.5648221637
top_spheres = df['market'].value_counts()[:10]
fig = px.pie(values = top_spheres,
names = top_spheres.index,
title = 'Top 10 most expensive markets');
fig.show()
plt.figure(figsize=(28,8))
sns.countplot(x = 'market',
data = marketdf,
order = marketdf['market'].value_counts().iloc[:15].index)
plt.xlabel('Market')
plt.ylabel('Counts')
plt.title('Amount of startup by Markets');
mark=marketdf['market']
mark.dropna(inplace=True)
wordcloud = WordCloud(background_color = 'white',
width = 5000,
height = 3000).generate(" ".join(mark))
plt.figure(figsize=(20,10))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis('off')
plt.show()
df_filtered = df[df['status'] != 'operating']
fig = px.histogram(df_filtered['status'],
title='Startups status')
fig.update_xaxes(categoryorder = 'total ascending')
fig.show()
From the above result, we can see that most of the companies either end up being acquired or succeed since their inception as startups.
rounds = {}
for i in df.columns:
if 'round_' in i:
rounds[i] = df[i].mean()/10**6
fig = px.bar(x = rounds.keys(),
y = rounds.values(),
labels = {'y': 'Investment ($ million)',
'x': 'Rounds'});
fig.show()
Each round, a startup companies get less and less funding - because it can make profit for it self.
fig = px.histogram(df,
x = 'founded_year',
labels = {'founded_year': 'Founded year'},
title = 'Startups founded years distribution')
fig.show()
Most of the startups were founded in the 2000s. The peak of startup establishments occurred in the year 2012, with a total of 3,534 startups founded.
fig = px.histogram(x = ['angel', 'grant', 'venture'],
y = [len(df[df['angel']!=0]),
len(df[df['grant']!=0]),
len(df[df['venture']!=0])],
title = 'Startups invesetment sources');
fig.update_xaxes(categoryorder = 'total ascending')
fig.show()
This is obvious, because venture capitals are porvided by professional investors and give startups way more funding.
top_countries = df['country'].value_counts()[:10]
top_countries_df = top_countries.reset_index()
top_countries_df.columns = ['Country', 'Count']
top_countries_df = top_countries_df.sort_values(by='Count', ascending=True)
top_countries_df
| Country | Count | |
|---|---|---|
| 9 | Ireland | 210 |
| 8 | Spain | 354 |
| 7 | Germany | 473 |
| 5 | Israel | 476 |
| 6 | China | 476 |
| 4 | India | 500 |
| 3 | France | 544 |
| 2 | Canada | 861 |
| 1 | United Kingdom | 1574 |
| 0 | United States | 19097 |
fig = px.bar(top_countries_df,
x = 'Count',
y = 'Country',
orientation = 'h',
title = 'Top 10 Countires with the Most Startups',
labels = {'Count': 'Number of startups: ',
'Country': 'Country: '})
fig.show()
As we see, US is the most popoular region for start a startup journey.
df.head()
| name | market | funding_total_usd | status | region | city | funding_rounds | founded_at | founded_month | founded_quarter | ... | product_crowdfunding | round_A | round_B | round_C | round_D | round_E | round_F | round_G | round_H | country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | #waywire | News | 1750000 | acquired | New York City | New York | 1.0 | 2012-06-01 | 2012-06 | 2012-Q2 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | United States |
| 2 | 'Rock' Your Paper | Publishing | 40000 | operating | Tallinn | Tallinn | 1.0 | 2012-10-26 | 2012-10 | 2012-Q4 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Estonia |
| 3 | (In)Touch Network | Electronics | 1500000 | operating | London | London | 1.0 | 2011-04-01 | 2011-04 | 2011-Q2 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | United Kingdom |
| 4 | -R- Ranch and Mine | Tourism | 60000 | operating | Dallas | Fort Worth | 2.0 | 2014-01-01 | 2014-01 | 2014-Q1 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | United States |
| 10 | 1,2,3 Listo | E-Commerce | 40000 | operating | Santiago | Las Condes | 1.0 | 2012-01-01 | 2012-01 | 2012-Q1 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Chile |
5 rows × 35 columns
top_funding = df[['country','funding_total_usd']]
top_funding = top_funding.groupby('country')['funding_total_usd'].sum().reset_index()
top_funding = top_funding.sort_values(by = 'funding_total_usd', ascending = False).head(10)
top_funding
| country | funding_total_usd | |
|---|---|---|
| 100 | United States | 385898175119 |
| 20 | China | 21937000804 |
| 99 | United Kingdom | 16358468613 |
| 42 | India | 11596801424 |
| 17 | Canada | 9957982897 |
| 34 | Germany | 6855077419 |
| 45 | Israel | 5402961851 |
| 33 | France | 3750926917 |
| 87 | Spain | 3341128543 |
| 64 | Netherlands | 2969230977 |
fig = px.bar(top_funding,
x = 'funding_total_usd',
y = 'country',
title = 'Top 10 countries by total funding',
labels = {'x': 'Total fund (4 billions)',
'y': 'Country: '
})
fig.show()
The answer is yes. Unites states is also the biggest funded startups.
how about the companies?
top_companies = df.sort_values(by=['funding_total_usd'], ascending = False) [:15]
fig = px.bar(data_frame = top_companies,
y = 'funding_total_usd',
x = 'name',
title = 'Top 15 most succesfull companies',
labels = {'x': 'Total funding: ',
'y': 'Company: '}
);
fig.show()
fig = px.imshow(df.corr());
fig.show();
C:\Users\KeisezrG\AppData\Local\Temp\ipykernel_9720\1654500210.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
The Correlation between debt_financing and funding_total_usd. Let's inspect more about that variable.
df_filtered = df[(df['debt_financing']>0)&(df['funding_total_usd']/10**9 < 30)]
sns.scatterplot(data = df_filtered,
x = 'funding_total_usd',
y = 'debt_financing',
marker = 'o'
)
sns.regplot(data=df_filtered,
x='funding_total_usd',
y='debt_financing',
scatter=False)
plt.xlabel('Funding Total (USD)')
plt.ylabel('Debt Financing')
plt.title('Scatter Plot without Outliers')
plt.show()
The scatter plot indicates the more money you need for the startup, the more debt_financing you'll need.
Hypothesis: the seed we raise depends on the founded quater
are there any 'good' quaters to start first fund raising?
df['founded_quarter'].head()
0 2012-Q2 2 2012-Q4 3 2011-Q2 4 2014-Q1 10 2012-Q1 Name: founded_quarter, dtype: object
Extract quater into separate solumn first
def separate_quarter(text: str):
return text.split('-')[-1]
df['quater'] = df['founded_quarter'].apply(separate_quarter)
df_by_q_sort = df.sort_values(by='quater')
num_seed = []
for q in df_by_q_sort['quater'].unique():
num_seed.append(len(df_by_q_sort[(df_by_q_sort['quater']==q)&(df_by_q_sort['seed'] > 0)]))
fig = px.histogram(
y=num_seed,
x=df_by_q_sort['quater'].unique(),
title='Quater x Seed popularity plot',
color_discrete_sequence=["teal"],
labels={'x': 'Quater the startup was founded at',
'y': 'Number of seeds raised'})
fig.show()
The biggest seed raised is Q1 and Q1 is also the most popular option among investors. The visualization above suggests that the first quarter is the most popular time for startups to secure their initial funding, supporting the validity of the hypothesis. Therefore, it is advisable to apply for seed funding during that period.